Acute kidney injury (AKI) is one of the most common complications seen in hospitalized patients. The ability to predict which patients are at highest risk for AKI would allow clinicians to intervene and prevent poor patient outcomes. Despite many attempts, however, AKI predicting models fail to perform in way that motivates their use in clinical decision support. Understanding which risk factors contribute most to AKI prediction, and how those risk factors vary based on ICU status and admission type, can inform future studies seeking to predict AKI. This study aims to quantify the predictive power of various risk factors for AKI based on admission type and patient location in the hospital.
A common condition in hospitalized patients, rates of AKI are estimated to be as high as 7% in hospitalized patients and 30% in ICU patients. (Goyal et al.). AKI is a sudden loss of renal function due to non-renal causes, such as dehydration, reduced blood flow to the kidney, nephrotoxic medications, or sepsis. AKI is usually reversible, but if not addressed can progress to permanent kidney damage and multi-organ failure. Diagnosing AKI early can prevent poor patient outcomes (Goyal et al.). Many groups have attempted to build decision making tools to assist clinicians in predicting AKIs early in the disease stage allowing rapid intervention and preventing sequelae. However, the majority of AKI prediction tools, including many using state of the art machine learning methods on large, diverse datasets, struggle to improve on a trained clinicians ability to identify AKIs (De Vlieger et al). In particular, the black box issue of machine learning algorithms limits the clinicians ability to trust the models, understand what is happening to the patient, and base decisions on model outputs. One way to enhance model transparency, while improving performance is to understand which risk factors are important to model decision making. Understanding which risk factors contribute most to AKI prediction, and how those risk factors vary based on ICU status and admission type can inform future work seeking to predict AKI.
Prior research has built a data set of patients on combination anti-hypertensive with either NSAID or oxycodone to assess AKI risk (Miano et al). This dataset can be re-analyzed to identify which of the studied risk factors contributes most to prediction of AKI. Patients on combination antihypertensives and either NAISDs or oxycodone represent a common subset of hospitalized patients, thus results on this population can be expected to generalize well to all patients. This study seeks to answer the question: in patients receiving treatment for hypertension and pain, do the most predictive risk factors for AKI differ by ICU vs non-ICU and admission type? A secondary question explored by this study is: do models perform better when continuous variables are treated as categorical variables?
AKI prediction is a prominent interdisciplinary space. Clinicians, informaticians, computer scientists, and epidemiologists frequently work together and publish in this space. The work typically combines new machine learning approaches with classic clinical definitions of kidney injury and risk factors. No discipline alone has enough expertise to address the problem, making interdisciplinary collaboration key to success. Conversations with Todd Miano (PharmD, PhD research at University of Pennsylvania’s Department of Biostatistics, Epidemiology, and Bioinformatics (DBEI) and Critical Care Pharmacist) highlighted the difficulty in defining AKI though KDIGO’s multipart definition based on the non-specific marker, serum creatinine. Informatician, John Holmes (Professor of Medical Bioinformatics in Epidemiology also at the University of Pennsylvania DBEI) emphasized the fact that the various prediction models incorporate similar, large sets of risk factors. He suggested that different risk factors my play varying roles for patients in different clinical settings, such as patients in the ICU or post-operative patients. These discussions motivated the proposed study of quantifying predictive power of various risk factors, but splitting the analysis based on admission type and patient location.
A git repo for this project is available here.
Todd Miano provided the data used in this project, and was previously described in Miano et al. Briefly, data was collected on 27,741 adult patients with greater than 24-hours exposure to one both an anti-hypertensive agent and an analgesic. Ant-hypertensives of interest included either a renin-angiotensin inhibitor (RAS-I) or the calcium channel blocker amlodipine. Analgesics of interest included wither an NSAID or oxycodone. Data was collected between 2014 and 2017 from patients admitted to University of Pennsylvania Health System hospitals using the Penn Data Store warehouse for electronic health records in the health system. Patients were excluded if they had contraindications to NSAIDs, including unresolved AKI within 2 weeks before entry, baseline serum creatinine >2mg/dl (an indicator of renal injury), end stage renal disease, renal replacement therapy, platelet count \(< 100*10^{11}\) (risk of bleed), pregnancy (a contraindication to RAS-I treatment), lack of baseline or follow-up serum creatinine, and history of solid organ transplant. Information was also collected variables associated with AKI, including cardiovascular conditions, diabetes, and cancer. Laboratory values and presence of nephrotoxic medications at entry are also included. In this data set, AKI’s and AKI severity were labeled according to Kidney Disease Improving Global Outcomes (KDIGO) criteria.
In this data set, comorbid conditions and laboratory values were assessed at the point in time where the patient qualified for the study (entry). That is when the patient had greater than 24-hours exposure to the drug-combinations of interest. Thus, when using this data to predict AKI, we necessarily are predicting AKI at the same point in time. Therefore, the prediction methods evaluated in this study attempt to predict AKIs using information available to clinicians up to 24 hours after initiating combination therapy of anti-hypertensives and opioids. The goal of this research is to identify which predictors are the most helpful in predicting AKIs in this setting, and how they differ, if at all, between the medical floor and the ICU.
To do this, the dataset was imported and examined. There were no missing data. However, there were abnormally high and low BMI values. Subjects with BMI <12 and >60 were removed. Next, the data were separated it into 1) the complete dataset, 2) ICU cohort, and 3) the medical floor cohort.
A second set of data with continuous variables categorized by reference ranges was created. The continuous variables BMI, white blood cell count, hemoglobin, platelets, sodium, potassium, chloride, creatinine, and index GFR were categorized based on reference ranges (CDC, Kratz 2022;, UPenn, KDIGO). The categorical variables age was categorized as 18-40, 40-65, and greater than 65. Prior length of stay was grouped as zero days, 1-2 days, 3-7 days, and greater than seven days. Age and prior length of stay ranges were chosen arbitrarily.
Generalized logistic regression models were created for each data sub-set. K-fold cross validation were used to estimate prediction error. K=10 random sub-samples randomly created and K-1=9 subsets were used for training, one subset was used for testing. Area under the ROC and area under the precision recall curves were generated to evaluate model performance.
Predictor importance for logistic regression models were identified based on test-statistic magnitude and statistical significance.
Next, decision tree models were trained for each model. K-fold cross validation were used to estimate prediction error. K=10 random sub-samples randomly created and K-1=9 subsets were used for training, one subset was used for testing. To build the forest, 100 trees were used. Area under the ROC and area under the precision recall curves were generated to evaluate model performance.
Predictor importance was identified based on magnitude of mean decrease in GINI importance scores. Importantly, mean decrease in GINI importance scores are known to be inflated when involving continuous variables and categorical variables with many categories. When calculating GINI scores, the GINI index is calculated for each cut-point in a classification tree. Variables that are continuous or have many categories will have many more potential cut-points than dichotomous variables and therefore have a higher probability of coincidentally producing a better cut-point and getting a higher importance score (Nembrini, 2018; Strobl, 2007). To evaluate the effect of continuous variables on mean decrease in GINI scores for this study, the analysis was repeated using the dataset with all continuous variables categorized according to reference ranges (see above).
# Import packages, if needed
install.packages("dplyr")
install.packages("gtsummary")
install.packages("modelsummary")# Load packages
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gtsummary)
library(modelsummary)
library(ggplot2)
library(randomForest)## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(tibble)
library(tidyr)
library(stringr)
library(pROC)## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(PRROC)Data were uploaded to Zenodo and are available here.
#Load the data
data <- read.csv("/Users/haedi/Repos/BMIN503_Final_Project/BMIN503_Final_Project/miano_thelen_dataset.csv", header = TRUE)
# Drop columns not needed:
# ddiGroup was needed to classify exposure in the prior study, but not needed
# here, nsaidType is missing data for 81% and so was left out.
# aki_stage, rrt, and mortHosp30 are outcome variables needed in the prior study
# but not here; we are only interested in binary AKI as the outcome variable.
# Table 1, sorted by ICU stay, variables renamed for readability
data %>%
select(!c(pid, ddiGroup, nsaidType, aki_stage, rrt, mortHosp30)) %>%
filter(bmi>12 & bmi<60) %>%
rename(Analgesic = pain, Anithypertensive = bp, AKI = aki, Age = age,
Sex = sex, Race = race, BMI = bmi, "Admission Type" = admType,
"Prior Length of Stay" = priorLos, "Perioperative Day" = periOp,
Ventilator = vent, "Congestive Heart Failure" = chf,
"Chronic Pulmonary Disease" = cpd,
HIV = hiv, "Liver Disease" = liver, "periveral Vascular Disease" = pvd,
"Ceribrovascular Disease" = cva, "Myocardial Infarction" = mif,
"Valvular Disease" = valve, Hypertension = htn,
"Cardiac Arrhythmias" = arry,
"Pulmonary Circulation Disorder" = pCirc, Obesity = obese,
"Weight Loss" = wtLoss,
"Fluid and Electrolyte Disorder"= fluid, "Chronic Kidney Disease" = ckd,
"Atrial Fibrillation" = afib, "Obstructive Sleep Apnea" = osa,
"Diabetes Mellitus" = dm, Cancer = cancer, "Prior AKI" = prior_aki,
Vancomycin = vanco,
"Bactrim" = bactrim, "Vasopressors" = pressor,
"Other Nephrotoxins" = ntxOther,
"Nephrotoxic Antibiotics" = abxNTX, "Diruetics" = diuretic,
"Broad Spectrum Antibiotics" = gramNegBroad,
"Narrow Spectrum Antibiotics" = gramNegNarrow,
"WBC x10^9cells/L" = wbc, "Hmoglobin g/dl" = hgb,
"Platelets x10^11/L" = platelets, "Sodium, mEq/L" = sodium,
"Potassium, mEq/L" = potassium, "Chloride mEq/L" = chloride,
"Serum Creatinine at entry" = creatinine, "Baseline GFR" = indexGFR) %>%
mutate(Cancer = factor(Cancer, level = c(0,1,2),
labels = c("None", "Non-metastatic", "Metastatic"))) %>%
mutate(icu = factor(icu, levels = c(0,1), labels = c("non-ICU", "ICU"))) %>%
tbl_summary(by = icu)| Characteristic | non-ICU, N = 23,0311 | ICU, N = 4,4851 |
|---|---|---|
| Analgesic | ||
| NSAID | 4,667 (20%) | 704 (16%) |
| Oxycodone | 18,364 (80%) | 3,781 (84%) |
| Anithypertensive | ||
| CCB | 4,792 (21%) | 1,045 (23%) |
| RAS | 18,239 (79%) | 3,440 (77%) |
| AKI | 1,867 (8.1%) | 585 (13%) |
| Age | 64 (55, 72) | 63 (54, 71) |
| Sex | ||
| Female | 11,304 (49%) | 1,694 (38%) |
| Male | 11,727 (51%) | 2,791 (62%) |
| Race | ||
| Black | 8,916 (39%) | 1,125 (25%) |
| Other / Unk | 2,145 (9.3%) | 575 (13%) |
| White | 11,970 (52%) | 2,785 (62%) |
| BMI | 30 (26, 35) | 29 (25, 34) |
| Admission Type | ||
| Medicine | 9,335 (41%) | 1,039 (23%) |
| Surgery | 13,696 (59%) | 3,446 (77%) |
| Prior Length of Stay | 2 (1, 3) | 2 (1, 4) |
| Perioperative Day | ||
| Not post-op | 17,474 (76%) | 2,860 (64%) |
| POD one | 2,503 (11%) | 978 (22%) |
| POD three | 736 (3.2%) | 173 (3.9%) |
| POD two | 1,907 (8.3%) | 371 (8.3%) |
| POD zero | 411 (1.8%) | 103 (2.3%) |
| Ventilator | 340 (1.5%) | 682 (15%) |
| Congestive Heart Failure | 5,696 (25%) | 1,799 (40%) |
| Chronic Pulmonary Disease | 6,411 (28%) | 1,478 (33%) |
| HIV | 347 (1.5%) | 37 (0.8%) |
| Liver Disease | 1,184 (5.1%) | 212 (4.7%) |
| periveral Vascular Disease | 3,429 (15%) | 1,092 (24%) |
| Ceribrovascular Disease | 2,259 (9.8%) | 646 (14%) |
| Myocardial Infarction | 3,251 (14%) | 1,206 (27%) |
| Valvular Disease | 3,476 (15%) | 1,232 (27%) |
| Hypertension | 20,664 (90%) | 3,777 (84%) |
| Cardiac Arrhythmias | 4,477 (19%) | 1,359 (30%) |
| Pulmonary Circulation Disorder | 2,078 (9.0%) | 640 (14%) |
| Obesity | 5,161 (22%) | 871 (19%) |
| Weight Loss | 1,418 (6.2%) | 429 (9.6%) |
| Fluid and Electrolyte Disorder | 5,783 (25%) | 1,640 (37%) |
| Chronic Kidney Disease | 2,909 (13%) | 602 (13%) |
| Atrial Fibrillation | 3,913 (17%) | 1,181 (26%) |
| Obstructive Sleep Apnea | 3,362 (15%) | 664 (15%) |
| Diabetes Mellitus | ||
| DM-Comp | 2,004 (8.7%) | 339 (7.6%) |
| DM-Non-comp | 6,454 (28%) | 1,209 (27%) |
| None | 14,573 (63%) | 2,937 (65%) |
| Cancer | ||
| None | 18,747 (81%) | 3,851 (86%) |
| Non-metastatic | 2,912 (13%) | 441 (9.8%) |
| Metastatic | 1,372 (6.0%) | 193 (4.3%) |
| Prior AKI | ||
| None | 20,906 (91%) | 3,748 (84%) |
| Yes-resolved | 2,125 (9.2%) | 737 (16%) |
| Vancomycin | 4,702 (20%) | 1,555 (35%) |
| Bactrim | 713 (3.1%) | 66 (1.5%) |
| Vasopressors | 541 (2.3%) | 643 (14%) |
| Other Nephrotoxins | 526 (2.3%) | 64 (1.4%) |
| Nephrotoxic Antibiotics | 555 (2.4%) | 363 (8.1%) |
| Diruetics | 9,193 (40%) | 2,298 (51%) |
| Broad Spectrum Antibiotics | 2,717 (12%) | 641 (14%) |
| Narrow Spectrum Antibiotics | 9,072 (39%) | 2,047 (46%) |
| WBC x10^9cells/L | 8.9 (6.9, 11.4) | 10.8 (8.3, 14.0) |
| Hmoglobin g/dl | 11.00 (9.60, 12.40) | 11.00 (9.60, 12.40) |
| Platelets x10^11/L | 221 (173, 282) | 194 (149, 251) |
| Sodium, mEq/L | 138.0 (135.0, 140.0) | 138.0 (136.0, 140.0) |
| Potassium, mEq/L | 4.10 (3.80, 4.40) | 4.10 (3.80, 4.40) |
| Chloride mEq/L | 103.0 (101.0, 106.0) | 105.0 (102.0, 109.0) |
| Serum Creatinine at entry | 0.90 (0.72, 1.10) | 0.90 (0.71, 1.08) |
| Baseline GFR | 72 (57, 88) | 75 (59, 89) |
| 1 n (%); Median (IQR) | ||
# Mutate binary variables to factors, keep only variables we will use
data.c.f <- data %>%
select(!c(pid, ddiGroup, nsaidType, aki_stage, rrt, mortHosp30,)) %>%
filter(bmi>12 & bmi<60) %>%
mutate(across('race', str_replace, 'Other / Unk', 'Other')) %>%
mutate(aki = factor(aki, level = c(0,1), labels = c("No AKI", "AKI"))) %>%
mutate(cancer = factor(cancer, level = c(0,1,2),
labels = c("no_cancer", "non-metastatic", "metastatic"))) %>%
mutate(icu = factor(icu, levels = c(0,1), labels = c("non-ICU", "ICU"))) %>%
mutate_at(c("pain", "bp", "sex", "race", "admType", "periOp", "vent", "chf",
"cpd", "hiv","liver", "pvd", "cva", "mif", "valve", "htn", "arry",
"pCirc", "obese", "wtLoss", "fluid", "ckd", "afib", "osa", "dm",
"prior_aki", "vanco", "bactrim", "pressor","ntxOther", "abxNTX",
"diuretic","gramNegBroad", "gramNegNarrow"), factor) %>%
relocate(where(is.factor))%>%
relocate(aki)
# Break up the datasets to ICU and Non-ICU cohorts
icu <- data.c.f %>%
filter(icu=="ICU") %>%
select(!c(icu))
nonicu <- data.c.f%>%
filter(icu=="non-ICU")%>%
select(!c(icu))Table 1 shows patient characteristics in the ICU vs non-ICU.
Next, we build the categorical dataset.
# Categorizing the continuous variables using reference ranges
data.c.f.cat = data.c.f %>%
mutate(age = cut(age, breaks = c(18, 40, 65, Inf),
labels = c("18-40", "41-65", ">65"),
include.lowest = T)) %>%
mutate(bmi = cut(bmi, breaks = c(-Inf, 18.5, 25, 30, Inf), # CDC
labels = c("Underweight", "Healthy-weight",
"Overweight","Obesity"))) %>%
mutate(priorLos = cut(priorLos, breaks = c(-Inf, 0, 2, 7, Inf),
labels = c("0", "1-2", "3-7", ">7"),
inclue.lowest = F)) %>%
mutate(wbc = cut(wbc, breaks = c(-Inf, 3.5, 9.1, Inf), # Harrison's
labels = c("low", "ref", "high"))) %>%
mutate(hgb = cut(hgb, breaks = c(-Inf,12, 16, Inf), # Harrison's, combo M&F
labels = c("low", "ref", "high"))) %>%
mutate(platelets = cut(platelets, breaks = c(-Inf, 149, 400, Inf),# Upenn
labels = c("low", "ref", "high"))) %>%
mutate(sodium = cut(sodium, breaks = c(-Inf, 135, 144, Inf),# Upenn
labels = c("low", "ref", "high"))) %>%
mutate(potassium = cut(potassium, breaks = c(-Inf, 3.5, 5.1, Inf),# Upenn
labels = c("low", "ref", "high"))) %>%
mutate(chloride = cut(chloride, breaks = c(-Inf, 100, 111, Inf),# Upenn
labels = c("low", "ref", "high"))) %>%
mutate(creatinine = cut(creatinine, breaks = c(-Inf, 0.49, 1.2, Inf),
# Harrison's, combo M&F
labels = c("low", "ref", "high"))) %>%
mutate(indexGFR = cut(indexGFR, breaks = c(-Inf, 15, 30, 45, 60, 90, Inf),
labels = c("G5", "G4", "G3b", "G3a", "G2", "G1")))
# create ICU and non-ICU subsets
icu.cat <- data.c.f.cat %>%
filter(icu=="ICU") %>%
select(!c(icu))
nonicu.cat <- data.c.f.cat%>%
filter(icu=="non-ICU")%>%
select(!c(icu))
# Table 2, continuous variables categorized
data.c.f.cat %>%
select(c(age, bmi, priorLos, wbc, hgb, platelets, sodium, potassium,
chloride, creatinine, indexGFR, icu)) %>%
rename( Age = age, BMI = bmi, "Prior Length of Stay" = priorLos,
"WBC x10^9cells/L" = wbc, "Hmoglobin g/dl" = hgb,
"Platelets x10^11/L" = platelets, "Sodium, mEq/L" = sodium,
"Potassium, mEq/L" = potassium, "Chloride mEq/L" = chloride,
"Serum Creatinine at entry" = creatinine, "Baseline eGFR" = indexGFR) %>%
tbl_summary(by = icu)| Characteristic | non-ICU, N = 23,0311 | ICU, N = 4,4851 |
|---|---|---|
| Age | ||
| 18-40 | 1,029 (4.5%) | 256 (5.7%) |
| 41-65 | 11,749 (51%) | 2,391 (53%) |
| >65 | 10,253 (45%) | 1,838 (41%) |
| BMI | ||
| Underweight | 545 (2.4%) | 115 (2.6%) |
| Healthy-weight | 4,544 (20%) | 1,017 (23%) |
| Overweight | 6,640 (29%) | 1,404 (31%) |
| Obesity | 11,302 (49%) | 1,949 (43%) |
| Prior Length of Stay | ||
| 0 | 1,580 (6.9%) | 273 (6.1%) |
| 1-2 | 14,317 (62%) | 2,494 (56%) |
| 3-7 | 5,390 (23%) | 1,253 (28%) |
| >7 | 1,744 (7.6%) | 465 (10%) |
| WBC x10^9cells/L | ||
| low | 364 (1.6%) | 13 (0.3%) |
| ref | 11,437 (50%) | 1,454 (32%) |
| high | 11,230 (49%) | 3,018 (67%) |
| Hmoglobin g/dl | ||
| low | 16,087 (70%) | 3,067 (68%) |
| ref | 6,768 (29%) | 1,384 (31%) |
| high | 176 (0.8%) | 34 (0.8%) |
| Platelets x10^11/L | ||
| low | 3,153 (14%) | 1,127 (25%) |
| ref | 18,312 (80%) | 3,176 (71%) |
| high | 1,566 (6.8%) | 182 (4.1%) |
| Sodium, mEq/L | ||
| low | 5,913 (26%) | 1,026 (23%) |
| ref | 16,814 (73%) | 3,247 (72%) |
| high | 304 (1.3%) | 212 (4.7%) |
| Potassium, mEq/L | ||
| low | 2,643 (11%) | 485 (11%) |
| ref | 20,010 (87%) | 3,930 (88%) |
| high | 378 (1.6%) | 70 (1.6%) |
| Chloride mEq/L | ||
| low | 5,387 (23%) | 761 (17%) |
| ref | 16,986 (74%) | 3,188 (71%) |
| high | 658 (2.9%) | 536 (12%) |
| Serum Creatinine at entry | ||
| low | 446 (1.9%) | 148 (3.3%) |
| ref | 19,138 (83%) | 3,736 (83%) |
| high | 3,447 (15%) | 601 (13%) |
| Baseline eGFR | ||
| G5 | 0 (0%) | 0 (0%) |
| G4 | 308 (1.3%) | 35 (0.8%) |
| G3b | 2,174 (9.4%) | 384 (8.6%) |
| G3a | 4,328 (19%) | 766 (17%) |
| G2 | 11,113 (48%) | 2,216 (49%) |
| G1 | 5,108 (22%) | 1,084 (24%) |
| 1 n (%) | ||
Table 2 shows categorized continuous variables for patient characteristics in the ICU vs non-ICU.
Now that the datasets are created, we can build the models and identify the variables that have the best predictive value. We first evaluate the complete, ICU, and non-ICU datasets for both logistic regression and random forest with continuous variables. We then repeat the analysis using categorical variables only.
From Table 1, we can see that AKI rates are higher in the ICU. Next, we visualize the relationship between ICU stay and AKI.
ggplot(data = data.c.f, aes(x=icu, fill = aki)) +
geom_bar(position = "fill") +
labs(title = "Proportion of study patients with AKI by location",
x = "ICU Status", y ="Proporiton" )Figure 1 Proportion of study patients with AKI in the ICU vs non-ICU.
Using a Chi-square test, we can check if there is a statistically significant difference in AKI rates.
# Chi-Square test
chisq.test(table(data.c.f$aki, data.c.f$icu))##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(data.c.f$aki, data.c.f$icu)
## X-squared = 112.12, df = 1, p-value < 2.2e-16
At the \(\alpha = 0.5\) level, we can conclude that in our study population, the AKI rates are different between non-ICU and ICU settings. This is helpful contextual information for our data, though, it is not part of the main analysis.
Next we create a logistic regression and random forest models for all three data subsets (complete, ICU, and non-ICU), then compare the predictors.
N = nrow(data.c.f)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm <- vector(mode = "numeric", length = N)
pred.outputs.rf <- vector(mode = "numeric", length = N)
obs.outputs <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
train <-filter(data.c.f, s != i)
test <- filter(data.c.f, s == i)
obs.outputs[1:length(s[s == i]) + offset] <- test$aki
#GLM train/test
glm.aki <- glm(aki ~ ., data = train, family = binomial(logit))
glm.predictions <- predict(glm.aki, test, type = "response")
pred.outputs.glm[1:length(s[s == i]) + offset] <- glm.predictions
#RF train/test
rf <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
rf.pred.curr <- predict(rf, newdata = test, type = "prob")
pred.outputs.rf[1:length(s[s == i]) + offset] <- rf.pred.curr[ , 2]
offset <- offset + length(s[s == i])
}# AUC
roc(obs.outputs, pred.outputs.glm) # glm with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.glm)
##
## Data: pred.outputs.glm in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.6924
roc(obs.outputs, pred.outputs.rf, ci = TRUE) # rf with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.rf, ci = TRUE)
##
## Data: pred.outputs.rf in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.6853
## 95% CI: 0.6743-0.6963 (DeLong)
# ROC Curves
plot.roc(obs.outputs, pred.outputs.glm, col = "darkred") # Glm, k fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs, pred.outputs.rf, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)Figure 2 ROC Curve for the Complete Dataset with Continuous Data
# PR Curves
pr <- pr.curve(pred.outputs.glm[obs.outputs == "2"],
pred.outputs.glm[obs.outputs == "1"], curve=TRUE)
plot(pr)Figure 3 PR Curve for the Logistic Regression Model for the Complete Dataset with Continuous Data
prrf <- pr.curve(pred.outputs.rf[obs.outputs == "2"],
pred.outputs.rf[obs.outputs == "1"], curve=TRUE)
plot(prrf)Figure 4 PR Curve for the Random Forest Model for the Complete Dataset with Continuous Data
N = nrow(icu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.icu <- vector(mode = "numeric", length = N)
pred.outputs.rf.icu <- vector(mode = "numeric", length = N)
obs.outputs.icu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
train <-filter(icu, s != i)
test <- filter(icu, s == i)
obs.outputs.icu[1:length(s[s == i]) + offset] <- test$aki
#GLM train/test
glm.aki.icu <- glm(aki ~ ., data = train, family = binomial(logit))
glm.predictions.icu <- predict(glm.aki.icu, test, type = "response")
pred.outputs.glm.icu[1:length(s[s == i]) + offset] <- glm.predictions.icu
#RF train/test
rf.icu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
rf.pred.curr.icu <- predict(rf.icu, newdata = test, type = "prob")
pred.outputs.rf.icu[1:length(s[s == i]) + offset] <- rf.pred.curr.icu[ , 2]
offset <- offset + length(s[s == i])
}#AUC
roc(obs.outputs.icu, pred.outputs.glm.icu) # glm with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.glm.icu)
##
## Data: pred.outputs.glm.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6651
roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE) # rf with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.rf.icu, ci = TRUE)
##
## Data: pred.outputs.rf.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6481
## 95% CI: 0.6244-0.6718 (DeLong)
#ROC Curves
plot.roc(obs.outputs.icu, pred.outputs.glm.icu, col = "darkred") # Glm, k fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)Figure 5 ROC Curves for the ICU Dataset with Continuous Data
# PR Curves
pr.icu <- pr.curve(pred.outputs.glm.icu[obs.outputs.icu == "2"],
pred.outputs.glm.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(pr.icu)Figure 6 PR Curve for the Logistic Regression Model for the ICU Dataset with Continuous Data
prrf.icu <- pr.curve(pred.outputs.rf.icu[obs.outputs.icu == "2"],
pred.outputs.rf.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(prrf.icu)Figure 7 PR Curve for the Random Forest Model for the ICU Dataset with Continuous Data
N = nrow(nonicu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.nonicu <- vector(mode = "numeric", length = N)
pred.outputs.rf.nonicu <- vector(mode = "numeric", length = N)
obs.outputs.nonicu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
train <-filter(nonicu, s != i)
test <- filter(nonicu, s == i)
obs.outputs.nonicu[1:length(s[s == i]) + offset] <- test$aki
#GLM train/test
glm.aki.nonicu <- glm(aki ~ ., data = train, family = binomial(logit))
glm.predictions.nonicu <- predict(glm.aki.nonicu, test, type = "response")
pred.outputs.glm.nonicu[1:length(s[s == i]) + offset] <- glm.predictions.nonicu
#RF train/test
rf.nonicu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
rf.pred.curr.nonicu <- predict(rf.nonicu, newdata = test, type = "prob")
pred.outputs.rf.nonicu[1:length(s[s == i]) + offset] <- rf.pred.curr.nonicu[ , 2]
offset <- offset + length(s[s == i])
}#AUC
roc(obs.outputs.nonicu, pred.outputs.glm.nonicu) # glm with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.glm.nonicu)
##
## Data: pred.outputs.glm.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.687
roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE) # rf with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.rf.nonicu, ci = TRUE)
##
## Data: pred.outputs.rf.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.6765
## 95% CI: 0.6638-0.6893 (DeLong)
#ROC Curves
plot.roc(obs.outputs.nonicu, pred.outputs.glm.nonicu, col = "darkred") # Glm, k fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)Figure 8 ROC Curves for the non-ICU Dataset with Continuous Data
# PR Curves
pr.nonicu <- pr.curve(pred.outputs.glm.nonicu[obs.outputs.nonicu == "2"],
pred.outputs.glm.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(pr.nonicu)Figure 9 PR Curve for the Logistic Regression Model for the non-ICU Dataset with Continuous Data
prrf.nonicu <- pr.curve(pred.outputs.rf.nonicu[obs.outputs.nonicu == "2"],
pred.outputs.rf.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(prrf.nonicu)Figure 10 PR Curve for the Random Forest Model for the non-ICU Dataset with Continuous Data
ROC performance decreased when the data was split between ICU and Non ICU, vs when data was combined. However, PRC performance increased for the ICU group. But were the top predictors different?
First, identify top predictors for the complete dataset.
# Identify top predictors for the combined data set
# Logistic Regression Importance
glm.aki.sum <- summary(glm.aki)
glm.aki.z <- data.frame(glm.aki.sum$coefficients[,3]) # make df of z-scores
glm.aki.z <- rownames_to_column(glm.aki.z, "feature") # convert rownames to a column
glm.aki.z <- glm.aki.z[!(glm.aki.z$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z)[2] <- "z" # rename column to z
glm.aki.z$col.flag <- glm.aki.z$z>0 # Create a color flag
ggplot(glm.aki.z, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
labs(y="Relative Importance (z-score)", x = "Feature")Figure 11 Relative Importance in the Logistic Regression Model for the Complete Dataset with Continuous Data
#Get 10 Top Predictors by GLM
glm.aki.pvals <- data.frame(glm.aki.sum$coefficients[,4])
top.glm.aki.pvals <- glm.aki.pvals %>%
filter(rownames(glm.aki.pvals) != "(Intercept)") %>%
arrange(glm.aki.sum.coefficients...4.) %>%
slice_head(n=10)# Random Forest Importance
gini <- as.data.frame(rf$importance)
gini <- rownames_to_column(gini, "feature") # convert rownames to a column
ggplot(gini, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
geom_bar(stat = "identity", fill = "darkred") +
coord_flip()+
labs(y="Mean Decrease in GINI Score", x = "Feature")Figure 12 Relative Importance in the Random Forest Model for the Complete Dataset with Continuous Data
# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini = gini %>%
arrange(desc(MeanDecreaseGini)) %>%
slice_head(n=10)Repeat the process for ICU subset.
# Identify top predictors for the ICU subset
# Logistic Regression Importance
glm.aki.sum.icu <- summary(glm.aki.icu)
glm.aki.z.icu <- data.frame(glm.aki.sum.icu$coefficients[,3]) # make df of zscores
glm.aki.z.icu <- rownames_to_column(glm.aki.z.icu, "feature") # convert rownames to a column
glm.aki.z.icu <- glm.aki.z.icu[!(glm.aki.z.icu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.icu)[2] <- "z" # rename column to z
glm.aki.z.icu$col.flag <- glm.aki.z.icu$z>0 # Create a color flag
ggplot(glm.aki.z.icu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
labs(y="Relative Importance (z-score)", x = "Feature")Figure 13 Relative Importance in the Logistic Regression Model for the ICU-Dataset with Continuous Data
#Get 10 Top Predictors by GLM
glm.aki.pvals.icu <- data.frame(glm.aki.sum.icu$coefficients[,4])
top.glm.aki.pvals.icu <- glm.aki.pvals.icu %>%
filter(rownames(glm.aki.pvals.icu) != "(Intercept)") %>%
arrange(glm.aki.sum.icu.coefficients...4.) %>%
slice_head(n=10)# Random Forest Importance
gini.icu <- as.data.frame(rf.icu$importance)
gini.icu <- rownames_to_column(gini.icu, "feature") # convert rownames to a column
ggplot(gini.icu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
geom_bar(stat = "identity", fill = "darkred") +
coord_flip()+
labs(y="Mean Decrease in GINI Score", x = "Feature")Figure 14 Relative Importance in the Random Forest Model for the ICU-Dataset with Continuous Data
# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.icu = gini.icu %>%
arrange(desc(MeanDecreaseGini)) %>%
slice_head(n=10)Repeat the process for non-ICU subset.
# Identify top predictors for the icu sub set
# Logistic Regression Importance
glm.aki.sum.nonicu <- summary(glm.aki.nonicu)
glm.aki.z.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,3]) # make df of zscores
glm.aki.z.nonicu <- rownames_to_column(glm.aki.z.nonicu, "feature") # convert rownames to a column
glm.aki.z.nonicu <- glm.aki.z.nonicu[!(glm.aki.z.nonicu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.nonicu)[2] <- "z" # rename column to z
glm.aki.z.nonicu$col.flag <- glm.aki.z.nonicu$z>0 # Create a color flag
ggplot(glm.aki.z.nonicu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
labs(y="Relative Importance (z-score)", x = "Feature")Figure 15 Relative Importance in the Logistic Regression Model for the non-ICU-Dataset with Continuous Data
#Get 10 Top Predictors by GLM
glm.aki.pvals.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,4])
top.glm.aki.pvals.nonicu <- glm.aki.pvals.nonicu %>%
filter(rownames(glm.aki.pvals.nonicu) != "(Intercept)") %>%
arrange(glm.aki.sum.nonicu.coefficients...4.) %>%
slice_head(n=10)# Random Forest Importance
gini.nonicu <- as.data.frame(rf.nonicu$importance)
gini.nonicu <- rownames_to_column(gini.nonicu, "feature") # convert rownames to a column
ggplot(gini.nonicu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
geom_bar(stat = "identity", fill = "darkred") +
coord_flip()+
labs(y="Mean Decrease in GINI Score", x = "Feature")Figure 16 Relative Importance in the Random Forest Model for the non-ICU-Dataset with Continuous Data
# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.nonicu = gini.nonicu %>%
arrange(desc(MeanDecreaseGini)) %>%
slice_head(n=10)Make a table that compare the top predictors for the different methods and the different datasets.
# Make a dataframe with all the results
top.compare.pval <-data.frame(rownames(top.glm.aki.pvals), rownames(top.glm.aki.pvals.nonicu), rownames(top.glm.aki.pvals.icu)) %>%
rename("combined" = 1, "non-ICU" =2, "ICU" = 3)
top.compare.pval## combined non-ICU ICU
## 1 diuretic1 fluid1 abxNTX1
## 2 fluid1 diuretic1 ckd1
## 3 ckd1 ckd1 bmi
## 4 bactrim1 bactrim1 ntxOther1
## 5 age age age
## 6 icuICU chf1 diuretic1
## 7 abxNTX1 wtLoss1 chf1
## 8 chf1 creatinine priorLos
## 9 bmi periOpPOD two afib1
## 10 pvd1 liver1 periOpPOD three
Table 3 Top 10 Predictors for the Logistic Regression Models on the Three Data Sub-sets
top.compare.gini <-data.frame(top.rf.gini$feature, top.rf.gini.nonicu$feature, top.rf.gini.icu$feature)
top.compare.gini## top.rf.gini.feature top.rf.gini.nonicu.feature top.rf.gini.icu.feature
## 1 indexGFR indexGFR indexGFR
## 2 bmi bmi creatinine
## 3 creatinine creatinine bmi
## 4 platelets platelets platelets
## 5 wbc wbc potassium
## 6 hgb hgb age
## 7 age age wbc
## 8 chloride chloride hgb
## 9 potassium potassium chloride
## 10 sodium sodium sodium
Table 4 Top 10 Predictors for the Random Forest Models on the Three Data Sub-sets
Discuss the results.
Analysis the continuous variables as categorized data.
N = nrow(data.c.f)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm <- vector(mode = "numeric", length = N)
pred.outputs.rf <- vector(mode = "numeric", length = N)
obs.outputs <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
train <-filter(data.c.f.cat, s != i)
test <- filter(data.c.f.cat, s == i)
obs.outputs[1:length(s[s == i]) + offset] <- test$aki
#GLM train/test
glm.aki <- glm(aki ~ ., data = train, family = binomial(logit))
glm.predictions <- predict(glm.aki, test, type = "response")
pred.outputs.glm[1:length(s[s == i]) + offset] <- glm.predictions
#RF train/test
rf <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
rf.pred.curr <- predict(rf, newdata = test, type = "prob")
pred.outputs.rf[1:length(s[s == i]) + offset] <- rf.pred.curr[ , 2]
offset <- offset + length(s[s == i])
}# AUC
roc(obs.outputs, pred.outputs.glm) # glm with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.glm)
##
## Data: pred.outputs.glm in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.7051
roc(obs.outputs, pred.outputs.rf, ci = TRUE) # rf with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.rf, ci = TRUE)
##
## Data: pred.outputs.rf in 25064 controls (obs.outputs 1) < 2452 cases (obs.outputs 2).
## Area under the curve: 0.6808
## 95% CI: 0.6699-0.6917 (DeLong)
# ROC Curves
plot.roc(obs.outputs, pred.outputs.glm, col = "darkred") # Glm, k fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs, pred.outputs.rf, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)Figure 17 ROC Curves for the Complete Dataset with Categorical Data
# PR Curves
pr <- pr.curve(pred.outputs.glm[obs.outputs == "2"],
pred.outputs.glm[obs.outputs == "1"], curve=TRUE)
plot(pr)Figure 18 PR Curves for the Logistic Regression on the Complete Dataset with Categorical Data
prrf <- pr.curve(pred.outputs.rf[obs.outputs == "2"],
pred.outputs.rf[obs.outputs == "1"], curve=TRUE)
plot(prrf)Figure 19 PR Curves for the Random Forest Model on the Complete Dataset with Categorical Data
N = nrow(icu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.icu <- vector(mode = "numeric", length = N)
pred.outputs.rf.icu <- vector(mode = "numeric", length = N)
obs.outputs.icu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
train <-filter(icu.cat, s != i)
test <- filter(icu.cat, s == i)
obs.outputs.icu[1:length(s[s == i]) + offset] <- test$aki
#GLM train/test
glm.aki.icu <- glm(aki ~ ., data = train, family = binomial(logit))
glm.predictions.icu <- predict(glm.aki.icu, test, type = "response")
pred.outputs.glm.icu[1:length(s[s == i]) + offset] <- glm.predictions.icu
#RF train/test
rf.icu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
rf.pred.curr.icu <- predict(rf.icu, newdata = test, type = "prob")
pred.outputs.rf.icu[1:length(s[s == i]) + offset] <- rf.pred.curr.icu[ , 2]
offset <- offset + length(s[s == i])
}#AUC
roc(obs.outputs.icu, pred.outputs.glm.icu) # glm with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.glm.icu)
##
## Data: pred.outputs.glm.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6838
roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE) # rf with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs.icu, predictor = pred.outputs.rf.icu, ci = TRUE)
##
## Data: pred.outputs.rf.icu in 3900 controls (obs.outputs.icu 1) < 585 cases (obs.outputs.icu 2).
## Area under the curve: 0.6737
## 95% CI: 0.6512-0.6961 (DeLong)
#ROC Curves
plot.roc(obs.outputs.icu, pred.outputs.glm.icu, col = "darkred") # Glm, k fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.icu, pred.outputs.rf.icu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)Figure 20 ROC Curves for the ICU Dataset with Categorical Data
# PR Curves
pr.icu <- pr.curve(pred.outputs.glm.icu[obs.outputs.icu == "2"],
pred.outputs.glm.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(pr.icu)Figure 21 PR Curves for the Logistic Regression Model on the ICU Dataset with Categorical Data
prrf.icu <- pr.curve(pred.outputs.rf.icu[obs.outputs.icu == "2"],
pred.outputs.rf.icu[obs.outputs.icu == "1"], curve=TRUE)
plot(prrf.icu)Figure 22 PR Curves for the Random Forest Model on the ICU Dataset with Categorical Data
N = nrow(nonicu)
K = 10
set.seed(1234)
s = sample(1:K, size = N, replace = T)
pred.outputs.glm.nonicu <- vector(mode = "numeric", length = N)
pred.outputs.rf.nonicu <- vector(mode = "numeric", length = N)
obs.outputs.nonicu <- vector(mode = "numeric", length = N)
offset <- 0
for(i in 1:K){
train <-filter(nonicu.cat, s != i)
test <- filter(nonicu.cat, s == i)
obs.outputs.nonicu[1:length(s[s == i]) + offset] <- test$aki
#GLM train/test
glm.aki.nonicu <- glm(aki ~ ., data = train, family = binomial(logit))
glm.predictions.nonicu <- predict(glm.aki.nonicu, test, type = "response")
pred.outputs.glm.nonicu[1:length(s[s == i]) + offset] <- glm.predictions.nonicu
#RF train/test
rf.nonicu <- randomForest(aki ~ ., data = train, ntree = 100, importance = TRUE)
rf.pred.curr.nonicu <- predict(rf.nonicu, newdata = test, type = "prob")
pred.outputs.rf.nonicu[1:length(s[s == i]) + offset] <- rf.pred.curr.nonicu[ , 2]
offset <- offset + length(s[s == i])
}#AUC
roc(obs.outputs.nonicu, pred.outputs.glm.nonicu) # glm with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.glm.nonicu)
##
## Data: pred.outputs.glm.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.698
roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE) # rf with cross validation## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
##
## Call:
## roc.default(response = obs.outputs.nonicu, predictor = pred.outputs.rf.nonicu, ci = TRUE)
##
## Data: pred.outputs.rf.nonicu in 21164 controls (obs.outputs.nonicu 1) < 1867 cases (obs.outputs.nonicu 2).
## Area under the curve: 0.6723
## 95% CI: 0.6597-0.6849 (DeLong)
#ROC Curves
plot.roc(obs.outputs.nonicu, pred.outputs.glm.nonicu, col = "darkred") # Glm, k fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
plot.roc(obs.outputs.nonicu, pred.outputs.rf.nonicu, ci = TRUE, col = "navy", add = TRUE) # RF, K-fold cross validated## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
legend("bottomright", legend = c("glm cross-validation", "rf cross-validation"), col = c("darkred", "navy"), lwd = 1)Figure 23 ROC Curves for the non-ICU Dataset with Categorical Data
# PR Curves
pr.nonicu <- pr.curve(pred.outputs.glm.nonicu[obs.outputs.nonicu == "2"],
pred.outputs.glm.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(pr.nonicu)Figure 24 PR Curve for the Logistic Regression Model on the non-ICU Dataset with Categorical Data
prrf.nonicu <- pr.curve(pred.outputs.rf.nonicu[obs.outputs.nonicu == "2"],
pred.outputs.rf.nonicu[obs.outputs.nonicu == "1"], curve=TRUE)
plot(prrf.nonicu)Figure 23 PR Curve for the Random Forest Model on the non-ICU Dataset with Categorical Data
# Identify top predictors for the combined data set with categorized variables
# Logistic Regression Importance
glm.aki.sum <- summary(glm.aki)
glm.aki.z <- data.frame(glm.aki.sum$coefficients[,3]) # make df of z-scores
glm.aki.z <- rownames_to_column(glm.aki.z, "feature") # convert rownames to a column
glm.aki.z <- glm.aki.z[!(glm.aki.z$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z)[2] <- "z" # rename column to z
glm.aki.z$col.flag <- glm.aki.z$z>0 # Create a color flag
ggplot(glm.aki.z, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
labs(y="Relative Importance (z-score)", x = "Feature")Figure 24 Relative Importance in the Logistic Regression Model for the Complete Dataset with Categorical Data
#Get 10 Top Predictors by GLM
glm.aki.pvals <- data.frame(glm.aki.sum$coefficients[,4])
top.glm.aki.pvals <- glm.aki.pvals %>%
filter(rownames(glm.aki.pvals) != "(Intercept)") %>%
arrange(glm.aki.sum.coefficients...4.) %>%
slice_head(n=10)# Random Forest Importance
gini <- as.data.frame(rf$importance)
gini <- rownames_to_column(gini, "feature") # convert rownames to a column
ggplot(gini, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
geom_bar(stat = "identity", fill = "darkred") +
coord_flip()+
labs(y="Mean Decrease in GINI Score", x = "Feature")Figure 25 Relative Importance in the Random Forest Model for the Complete Dataset with Categorical Data
# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini = gini %>%
arrange(desc(MeanDecreaseGini)) %>%
slice_head(n=10)Repeat process for ICU subset with categorized data.
# Identify top predictors for the icu sub set
# Logistic Regression Importance
glm.aki.sum.icu <- summary(glm.aki.icu)
glm.aki.z.icu <- data.frame(glm.aki.sum.icu$coefficients[,3]) # make df of zscores
glm.aki.z.icu <- rownames_to_column(glm.aki.z.icu, "feature") # convert rownames to a column
glm.aki.z.icu <- glm.aki.z.icu[!(glm.aki.z.icu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.icu)[2] <- "z" # rename column to z
glm.aki.z.icu$col.flag <- glm.aki.z.icu$z>0 # Create a color flag
ggplot(glm.aki.z.icu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
labs(y="Relative Importance (z-score)", x = "Feature")Figure 26 Relative Importance in the Logistic Regression Model for the ICU Dataset with Categorical Data
#Get 10 Top Predictors by GLM
glm.aki.pvals.icu <- data.frame(glm.aki.sum.icu$coefficients[,4])
top.glm.aki.pvals.icu <- glm.aki.pvals.icu %>%
filter(rownames(glm.aki.pvals.icu) != "(Intercept)") %>%
arrange(glm.aki.sum.icu.coefficients...4.) %>%
slice_head(n=10)# Random Forest Importance
gini.icu <- as.data.frame(rf.icu$importance)
gini.icu <- rownames_to_column(gini.icu, "feature") # convert rownames to a column
ggplot(gini.icu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
geom_bar(stat = "identity", fill = "darkred") +
coord_flip()+
labs(y="Mean Decrease in GINI Score", x = "Feature")Figure 27 Relative Importance in the Random Forest Model for the ICU Dataset with Categorical Data
# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.icu = gini.icu %>%
arrange(desc(MeanDecreaseGini)) %>%
slice_head(n=10)Repeat process for non-ICU subset with categorized data.
# Identify top predictors for the icu sub set
# Logistic Regression Importance
glm.aki.sum.nonicu <- summary(glm.aki.nonicu)
glm.aki.z.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,3]) # make df of zscores
glm.aki.z.nonicu <- rownames_to_column(glm.aki.z.nonicu, "feature") # convert rownames to a column
glm.aki.z.nonicu <- glm.aki.z.nonicu[!(glm.aki.z.nonicu$feature=="(Intercept)"),] # drop intercept
colnames(glm.aki.z.nonicu)[2] <- "z" # rename column to z
glm.aki.z.nonicu$col.flag <- glm.aki.z.nonicu$z>0 # Create a color flag
ggplot(glm.aki.z.nonicu, aes(x = reorder(feature, z), y = z, fill = col.flag)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("navy", "darkred"), name = NULL, labels = c("Protective", "Predictive"))+
labs(y="Relative Importance (z-score)", x = "Feature")Figure 28 Relative Importance in the Logistic Regression Model for the non-ICU Dataset with Categorical Data
#Get 10 Top Predictors by GLM
glm.aki.pvals.nonicu <- data.frame(glm.aki.sum.nonicu$coefficients[,4])
top.glm.aki.pvals.nonicu <- glm.aki.pvals.nonicu %>%
filter(rownames(glm.aki.pvals.nonicu) != "(Intercept)") %>%
arrange(glm.aki.sum.nonicu.coefficients...4.) %>%
slice_head(n=10)# Random Forest Importance
gini.nonicu <- as.data.frame(rf.nonicu$importance)
gini.nonicu <- rownames_to_column(gini.nonicu, "feature") # convert rownames to a column
ggplot(gini.nonicu, aes(x = reorder(feature, MeanDecreaseGini) , y = MeanDecreaseGini))+
geom_bar(stat = "identity", fill = "darkred") +
coord_flip()+
labs(y="Mean Decrease in GINI Score", x = "Feature")Figure 29 Relative Importance in the Random Forest Model for the non-ICU Dataset with Categorical Data
# Get 10 top Predictors by RF (highest decrease in GINI)
top.rf.gini.nonicu = gini.nonicu %>%
arrange(desc(MeanDecreaseGini)) %>%
slice_head(n=10)Make a table that compare the top predictors for the different methods and the different datasets using the categorized data.
# Make a dataframe with all the results
top.compare.pval <-data.frame(rownames(top.glm.aki.pvals), rownames(top.glm.aki.pvals.nonicu), rownames(top.glm.aki.pvals.icu)) %>%
rename("combined" = 1, "non-ICU" =2, "ICU" = 3)
top.compare.pval## combined non-ICU ICU
## 1 creatinineref creatinineref creatinineref
## 2 creatininehigh fluid1 creatininehigh
## 3 diuretic1 diuretic1 abxNTX1
## 4 bactrim1 creatininehigh diuretic1
## 5 fluid1 bactrim1 ckd1
## 6 priorLos>7 ckd1 priorLos>7
## 7 ckd1 chf1 ntxOther1
## 8 icuICU priorLos>7 chf1
## 9 priorLos3-7 raceWhite indexGFRG2
## 10 abxNTX1 periOpPOD two indexGFRG1
Table 5 Top 10 Predictors for the Logistic Regression Models on the Three Data Sub-sets using Categorical Data
top.compare.gini <-data.frame(top.rf.gini$feature, top.rf.gini.nonicu$feature, top.rf.gini.icu$feature)
top.compare.gini## top.rf.gini.feature top.rf.gini.nonicu.feature top.rf.gini.icu.feature
## 1 indexGFR bmi indexGFR
## 2 bmi indexGFR bmi
## 3 priorLos priorLos priorLos
## 4 race race periOp
## 5 dm dm race
## 6 age age dm
## 7 chloride wbc chloride
## 8 periOp chloride age
## 9 platelets platelets platelets
## 10 wbc sodium sodium
Table 6 Top 10 Predictors for the Random Forest Models on the Three Data Sub-sets Using Categorical Data # Discussion
Goyal A, Daneshpajouhnejad P, Hashmi MF, et al. Acute Kidney Injury. [Updated 2022 Aug 18]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2022 Jan-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK441896/.
De Vlieger, Greeta; Kashani, Kianoushb; Meyfroidt, Geerta. Artificial intelligence to guide management of acute kidney injury in the ICU: a narrative review. Current Opinion in Critical Care: December 2020 - Volume 26 - Issue 6 - p 563-573 doi: 10.1097/MCC.0000000000000775.
Miano, Todd A., et al. “Effect of renin-angiotensin system inhibitors on the comparative nephrotoxicity of NSAIDs and opioids during hospitalization.” Kidney360 1.7 (2020): 604.
bmi cutpoint: https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html#InterpretedAdults
Kratz A, Pesce MA, Basner RC, Einstein AJ. Laboratory Values of Clinical Importance. In: Kasper D, Fauci A, Hauser S, Longo D, Jameson J, Loscalzo J. eds. Harrison’s Principles of Internal Medicine, 19e. McGraw Hill; 2014. Accessed November 23, 2022. https://accessmedicine-mhmedical-com.proxy.library.upenn.edu/content.aspx?bookid=1130§ionid=79722706
Nembrini S, König IR, Wright MN. The revival of the Gini importance? Bioinformatics. 2018 Nov 1;34(21):3711-3718. doi: 10.1093/bioinformatics/bty373. PMID: 29757357; PMCID: PMC6198850.
Strobl C, Boulesteix AL, Zeileis A, Hothorn T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007 Jan 25;8:25. doi: 10.1186/1471-2105-8-25. PMID: 17254353; PMCID: PMC1796903.